Add LLM analyzers #508

rbs-jacob · 2024-10-30T21:53:36Z

I have reviewed the OFRAK contributor guide and attest that this pull request is in accordance with it.

One sentence summary of this PR (This should go in the CHANGELOG!)

Add AI-enhanced LLM analyzers.

Anyone you think should look at this, specifically?

@whyitfor

…lyzer

rbs-jacob · 2024-10-30T21:54:11Z

Outstanding to do items:

Push LLM-enhanced program/function analyzers
Update changelog

whyitfor · 2024-10-30T22:34:47Z

ofrak_core/ofrak/core/llm.py

+    # The data summary is un-informative and verbose
+    if DataSummary in model.attributes:
+        del model.attributes[DataSummary]
+    return serialize(make_serializable(model.attributes))


Isn't this essentially a re-implementation of what the PJSONSerializationService does? Is there a reason to not use it?

PJSON has types alongside the values. This is useful for deserializing PJSON values, but would make this code overly complicated, since we never need to deserialize the resulting text and thus would end up adding logic to drop the types anyway.

Moreover, PJSON is nice as a wire format, but is not optimized for human-readable text "serialization." We want to deviate from its representation of some values to make the resulting text easier for the LLM to understand.

I also (personally) just don't like the PJSON serialization service.

Are we sure that the serialized PJSON would hinder the LLM analysis? This sounds like a hypothesis and not something that was tested.

We should, whenever possible, avoid writing ~80 lines of code if it can be avoided.

I could surely get something working with PJSON. The (admittedly untested) hypothesis is that the code to do so would be as long as this code, less clear than this code, and dependent on undocumented serialized PJSON structure, which may change in the future, necessitating changes here.

If you still think I should rewrite this to use PJSON, I will.

whyitfor

See commenda

whyitfor · 2024-11-04T12:44:33Z

ofrak_core/Dockerstub

+# Install Ollama
+RUN curl -L "https://ollama.com/download/ollama-linux-""$TARGETARCH"".tgz" | tar -C /usr/ -xzv


Are we sure we want to add this to the Docker image?

This feels like an incomplete install step -- we're adding a dependency that requires the user to download/install more things to actually use it. The impact is a larger image size for all users, plus more work for users who actually want to use the LLM feature.

How would you propose doing it instead?

It would be ideal to have this whole thing in a separate package so that users could conditionally include the AI components. Then we could install Ollama and pull a model without worrying about unnecessarily inflating the core image. But one of the original constraints was that this should be in OFRAK core.

If we don't have Ollama installed, then we can't run the tests without having the tests themselves pull down and install the binary. As far as I know, in other OFRAK tests, there is no precedent for installing a dependency like this. It would be doable, but seems a little weird.

On the other hand, we could pull a model in the OFRAK core Dockerstub after installing Ollama, but this seems like a bad idea. We don't want to add another several-hundred-MB dependency to the OFRAK core images. We especially don't want to add models this large to the base image if the user is just going to use OpenAI anyway.

Installing Ollama, but not pulling a model, seemed like a reasonable tradeoff. If it seems wrong to you, I would appreciate guidance on a better way to do it.

ofrak_core/Makefile

ofrak_core/ofrak/core/llm.py

whyitfor · 2024-11-04T13:05:22Z

ofrak_core/test_ofrak/components/test_llm_components.py

+    ofrak_injector.discover(ofrak_ghidra)
+
+
+async def test_llm_component(ofrak_context: OFRAKContext, model: str):


Can we use the https://github.com/redballoonsecurity/ofrak/blob/master/ofrak_core/test_ofrak/unit/component/analyzer/analyzer_test_case.py#L34 to run these tests?

Right now these tests are not validating that the analyzer successfully generates the needed attributes.

I can do that validation in these tests. It does not make sense to use AnalyzerTests subclass here since the analyzer must be run with resource.run. It requires a config, and thus cannot be called with resource.analyze, as happens in AnalyzerTests methods.

Also the tests do call get_attributes at the end, so they're validating that the attributes are added.

ofrak_core/test_ofrak/components/test_llm_components.py

whyitfor · 2024-11-04T13:24:21Z

ofrak_core/ofrak/core/llm.py

+    # The data summary is un-informative and verbose
+    if DataSummary in model.attributes:
+        del model.attributes[DataSummary]
+    return serialize(make_serializable(model.attributes))


Are we sure that the serialized PJSON would hinder the LLM analysis? This sounds like a hypothesis and not something that was tested.

We should, whenever possible, avoid writing ~80 lines of code if it can be avoided.

rbs-jacob added 9 commits October 24, 2024 16:28

Add initial version of LLM analyzer

2c067d0

Fix bugs when using in the UI

d9dcc13

Fix component configs default field=None bug

e883681

Update CHANGELOG

24c1124

Merge branch 'bugfix/default-none-ui-components' into feature/llm-ana…

9c34469

…lyzer

Merge branch 'master' into feature/llm-analyzer

57ef1ba

Add first-pass AI analyzer UI

9642b36

Fix frontend folder copying

9776803

Add simple LLM test using Ollama

87d9e7a

rbs-jacob requested a review from whyitfor October 30, 2024 21:53

rbs-jacob added 2 commits October 30, 2024 18:06

Remove print statement

3ae9c3c

Fix type errors

fdb4789

whyitfor reviewed Oct 30, 2024

View reviewed changes

rbs-jacob added 8 commits October 30, 2024 18:44

Add LLM function analyzer

cd629fd

Improve function analyzer output quality

486ee86

Add whole-program LLM analyzer

6b3d750

Add tests for LLM program analyzer

cc6ca8d

Merge branch 'master' into feature/llm-analyzer

d2ea6af

Fix failing tests

9ceb04e

Update CHANGELOG

546eaf8

Use the correct analyzer in the UI

025a694

whyitfor reviewed Nov 4, 2024

View reviewed changes

rbs-jacob added 5 commits November 4, 2024 15:57

Add explanatory docstring with install instructions

76672e4

Add LlmFunctionAnalyzer test case

3e1a1c1

Assert existence of attributes.description in tests

12cd0bb

Show system prompt in textarea

f60e905

Lint (oops)

af754e1

whyitfor mentioned this pull request Nov 12, 2024

Clean up publicly exported API with __all__ #514

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LLM analyzers #508

Add LLM analyzers #508

rbs-jacob commented Oct 30, 2024

rbs-jacob commented Oct 30, 2024 •

edited

Loading

whyitfor Oct 30, 2024

rbs-jacob Oct 31, 2024

whyitfor Nov 4, 2024

rbs-jacob Nov 4, 2024

whyitfor left a comment

whyitfor Nov 4, 2024

rbs-jacob Nov 4, 2024

whyitfor Nov 4, 2024

rbs-jacob Nov 4, 2024

rbs-jacob Nov 4, 2024

whyitfor Nov 4, 2024

		# Install Ollama
		RUN curl -L "https://ollama.com/download/ollama-linux-""$TARGETARCH"".tgz" \| tar -C /usr/ -xzv

		ofrak_injector.discover(ofrak_ghidra)


		async def test_llm_component(ofrak_context: OFRAKContext, model: str):

Add LLM analyzers #508

Are you sure you want to change the base?

Add LLM analyzers #508

Conversation

rbs-jacob commented Oct 30, 2024

rbs-jacob commented Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

whyitfor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rbs-jacob commented Oct 30, 2024 •

edited

Loading